{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. _clean_lat_long_user_guide:\n",
    "\n",
    "Geographic Coordinates\n",
    "======================"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Introduction\n",
    "------------\n",
    "\n",
    "The function :func:`clean_lat_long() <dataprep.clean.clean_lat_long.clean_lat_long>` cleans and standardizes a DataFrame column containing latitude and/or longitude coordinates. The function :func:`validate_lat_long() <dataprep.clean.clean_lat_long.validate_lat_long>` validates either a single coordinate or a column of coordinates, returning True if the value is valid, and False otherwise."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following latitude and longitude formats are supported by the `output_format` parameter:\n",
    "\n",
    "* Decimal degrees (dd): 41.5\n",
    "* Decimal degrees hemisphere (ddh): \"41.5° N\"\n",
    "* Degrees minutes (dm): \"41° 30′ N\"\n",
    "* Degrees minutes seconds (dms): \"41° 30′ 0″ N\"\n",
    "\n",
    "You can split a column of geographic coordinates into one column for latitude and another for longitude by setting the parameter ``split`` to True.\n",
    "\n",
    "Invalid parsing is handled with the `errors` parameter:\n",
    "\n",
    "* \"coerce\" (default): invalid parsing will be set to NaN\n",
    "* \"ignore\": invalid parsing will return the input\n",
    "* \"raise\": invalid parsing will raise an exception\n",
    "\n",
    "After cleaning, a **report** is printed that provides the following information:\n",
    "\n",
    "* How many values were cleaned (the value must have been transformed).\n",
    "* How many values could not be parsed.\n",
    "* A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.\n",
    "  \n",
    "The following sections demonstrate the functionality of `clean_lat_long()` and `validate_lat_long()`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### An example dataset with geographic coordinates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "df = pd.DataFrame({\n",
    "    \"lat_long\":\n",
    "    [(41.5, -81.0), \"41.5;-81.0\", \"41.5,-81.0\", \"41.5 -81.0\",\n",
    "     \"41.5° N, 81.0° W\", \"41.5 S;81.0 E\", \"-41.5 S;81.0 E\",\n",
    "     \"23 26m 22s N 23 27m 30s E\", \"23 26' 22\\\" N 23 27' 30\\\" E\",\n",
    "     \"UT: N 39°20' 0'' / W 74°35' 0''\", \"hello\", np.nan, \"NULL\"]\n",
    "})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Default `clean_lat_long()`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, the `output_format` parameter is set to \"dd\" (decimal degrees) and the `errors` parameter is set to \"coerce\" (set to NaN when parsing is invalid)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.clean import clean_lat_long\n",
    "clean_lat_long(df, \"lat_long\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note (41.5, -81.0) is considered not cleaned in the report since it's resulting format is the same as the input. Also, \"-41.5 S;81.0 E\" is invalid because if a coordinate has a hemisphere it cannot contain a negative decimal value."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Output formats\n",
    "\n",
    "This section demonstrates the supported latitudinal and longitudinal formats.\n",
    "\n",
    "### decimal degrees hemisphere (ddh)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_lat_long(df, \"lat_long\", output_format=\"ddh\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### degrees minutes (dm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_lat_long(df, \"lat_long\", output_format=\"dm\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### degrees minutes seconds (dms)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_lat_long(df, \"lat_long\", output_format=\"dms\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. `split` parameter\n",
    "\n",
    "The split parameter adds individual columns containing the cleaned latitude and longitude values to the given DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_lat_long(df, \"lat_long\", split=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Split can be used along with different output formats."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_lat_long(df, \"lat_long\", split=True, output_format=\"dm\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. `inplace` parameter\n",
    "This just deletes the given column from the returned dataframe. \n",
    "A new column containing cleaned coordinates is added with a title in the format `\"{original title}_clean\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_lat_long(df, \"lat_long\", inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `inplace` and `split`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_lat_long(df, \"lat_long\", split=True, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Latitude and longitude coordinates in separate columns\n",
    "\n",
    "### Clean latitude or longitude coordinates individually"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({\"lat\": [\" 30′ 0″ E\", \"41° 30′ N\", \"41 S\", \"80\", \"hello\", \"NA\"]})\n",
    "clean_lat_long(df, lat_col=\"lat\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combine and clean separate columns\n",
    "\n",
    "Latitude and longitude values are counted separately in the report."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({\"lat\": [\"30° E\", \"41° 30′ N\", \"41 S\", \"80\", \"hello\", \"NA\"],\n",
    "                      \"long\": [\"30° E\", \"41° 30′ N\", \"41 W\", \"80\", \"hello\", \"NA\"]})\n",
    "clean_lat_long(df, lat_col=\"lat\", long_col=\"long\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Clean separate columns and split the output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_lat_long(df, lat_col=\"lat\", long_col=\"long\", split=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. `validate_lat_long()` \n",
    "\n",
    "`validate_lat_long()` returns True when the input is a valid latitude or longitude value otherwise it returns False.\n",
    "Valid types are the same as `clean_lat_long()`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.clean import validate_lat_long\n",
    "print(validate_lat_long(\"41° 30′ 0″ N\"))\n",
    "print(validate_lat_long(\"41.5 S;81.0 E\"))\n",
    "print(validate_lat_long(\"-41.5 S;81.0 E\"))\n",
    "print(validate_lat_long((41.5, 81)))\n",
    "print(validate_lat_long(41.5, lat_long=False, lat=True))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({\"lat_long\": \n",
    "                   [(41.5, -81.0), \"41.5;-81.0\", \"41.5,-81.0\", \"41.5 -81.0\", \n",
    "                    \"41.5° N, 81.0° W\", \"-41.5 S;81.0 E\", \n",
    "                    \"23 26m 22s N 23 27m 30s E\", \"23 26' 22\\\" N 23 27' 30\\\" E\", \n",
    "                    \"UT: N 39°20' 0'' / W 74°35' 0''\", \"hello\", np.nan, \"NULL\"]\n",
    "                  })\n",
    "validate_lat_long(df[\"lat_long\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Validate only one coordinate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({\"lat\": \n",
    "                   [41.5, \"41.5\", \"41.5  \", \n",
    "                    \"41.5° N\", \"-41.5 S\", \n",
    "                    \"23 26m 22s N\", \"23 26' 22\\\" N\", \n",
    "                    \"UT: N 39°20' 0''\", \"hello\", np.nan, \"NULL\"]\n",
    "                  })\n",
    "validate_lat_long(df[\"lat\"], lat_long=False, lat=True)"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
